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Abstract 

A set of rules is defined to systematically number the groups and the 
atoms of organic molecules and, particularly, of polypeptides in a modular 
manner. Supported by this numeration, a set of internal coordinates is 
defined. These coordinates (termed Systematic, Approximately Separable 
and Modular Internal Coordinates, SASMIC) are straightforwardly writ- 
ten in Z-matrix form and may be directly implemented in typical Quantum 
Chemistry packages. A number of Perl scripts that automatically generate 
the Z-matrix files for polypeptides are provided as supplementary mate- 
rial. The main difference with other Z-matrix-like coordinates normally 
used in the literature is that normal dihedral angles ( "principal dihedrals" 
in this work) are only used to fix the orientation of whole groups and a 
somewhat non-standard type of dihedrals, termed "phase dihedrals", are 
used to describe the covalent structure inside the groups. This physical 
approach allows to approximately separate soft and hard movements of 
the molecule using only topological information and to directly imple- 
ment constraints. As an application, we use the coordinates defined and 
ab initio quantum mechanical calculations to assess the commonly as- 
sumed approximation of the free energy, obtained from "integrating out" 
the side chain degree of freedom x, by the Potential Energy Surface (PES) 
in the protected dipeptide HCO-L-Ala-NH2. We also present a sub-box 
of the Hessian matrix in two different sets of coordinates to illustrate the 
approximate separation of soft and hard movements when the coordinates 
defined in this work are used. 

PACS: 87.14.Ee, 87.15.-v, 87.15.Aa, 87.15. Cc, 89.75.-k 
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1 Introduction 



The choice of the coordmatcs used to describe a molecule is an important issue if 
computational considerations are to be taken into account and the efficiency of 
the simulations is pursued. This choice also affects the coding of applications. If 
cumbersomely defined coordinates are used, an unnecessary complexity may be 
added to the design of Monte Carlo movements, the construction and pruning 
of a database of structures [1,2] or the programming of molecular visualization 
and manipulation tools. 

Suitable coordinates frequently used to describe arbitrary conformations of 
molecules are the so-called "internal" or "valence- type" coordinates [3]. Their 
adequacy stems from a number of characteristics: first, they are closely related 
to chemically meaningful structural parameters, such as bond lengths or bond 
angles; second, they are local, in the sense that each one of them involves only 
a small number of (normally close) atoms in its definition; and finally, there are 
only 3N — 6 of them (where A'^ is the number of atoms in the molecule), in such 
a way that the overall rotation and translation have been naturally removed. 

There also exists a family of coordinates [4-7], extensively used in the in- 
ner calculations of many Quantum Chemistry packages (such as Gaussian [8] 
or GAMESS [9]) and based on the "natural internal coordinates" originally 
proposed by Pulay and coworkers [10], which are defined through linear com- 
binations of the original internals. These coordinates are specially designed 
to describe normal-mode vibrations in the immediate neighbourhood of energy 
minima and represent the best choice for accelerating convergence of geome- 
try optimizations in a particular basin of attraction, via diagonal estimation 
of the Hessian matrix [7]. Accordingly, they maximally separate hard and soft 
movements in these conditions. However, if the conformation of the molecule 
is far from a minimum, this type of coordinates lose great part of their mean- 
ing and they introduce many computational difficulties without increasing the 
efficiency. Also, some of the definitions are redundant [6, 10], i.e., they use a 
number of coordinates larger than the number of degrees of freedom. In this 
work, we will only discuss coordinates, such as internals or Cartesian, that may 
be conveniently used to specify an arbitrary conformation of the system and 
that can be directly related to simple geometrical variables. 

In macromolecules, such as proteins, the number of degrees of freedom is 
the main limiting factor when one tries to predict their behaviour via computer 
modeling. Therefore, it is also advisable that the set of coordinates chosen 
allows for a direct implementation of physically meaningful constraints that re- 
duce the dimensionality of the conformational space considered. Most of the 
expressions used in Statistical Mechanics or in Molecular Dynamics are best 
written in Cartesian coordinates, however, the implementation of constraints 
naturally appearing is far from being straightforward in these coordinates. In 
internal coordinates, on the contrary, the approximate separation of hard and 
soft movements of the system allows to easily constrain the molecule [11] by 
setting the hard coordinates (those that require a considerable amount of en- 
ergy to change noticeably) to constant values or to particular functions of the 
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soft coordinates. Moreover, in internal coordinates (and appealing to some rea- 
sonable approximations), the Statistical Mechanics formulae for the constrained 
system may be written in convenient closed form [12, 13]. 

Still, although the bond lengths and bond angles are customarily regarded 
as hard and their definition is unproblematic, the same is not true for dihe- 
dral angles. Some definitions of dihedrals may lead to difficulties or to worse 
separation of hard and soft modes. Let us exemplify this with a particular case: 

Consider the definition of Z-matrix-like [14] internal coordinates for the 
HC0-L-Ala-NH2 molecule in fig. [H Imagine that we "position" (i.e., we write 
the corresponding Z-matrix row) every atom up to the hydrogen denoted by Hg 
and that we are now prepared to position the hydrogens in the side chain (Hio, 
Hii and H12) via one bond length, one bond angle and one dihedral for each 
one of theno. A choice frequently seen in the literature [1,2, 15-17] is the one 
shown in table [TJ 



Atom name Bond length Bond angle Dihedral angle 



Hio 


(10,8) 


(10,8,5) 


71 


:=(10,8,5,3) 


Hii 


(11,8) 


(11,8,5) 


72 


:=(11,8,5,3) 


H12 


(12,8) 


(12,8,5) 


73 


:=(12,8,5,3) 



Table 1: A part of the internal coordinates, in Z-matrix form, of the protected dipep- 
tide HCO-L-Ala-NH2, as frequently defined in the literature. 

If we now perform the gedanken experiment that consists of taking a typi- 
cal conformation of the molecule and slightly moving each internal coordinate 
at a time while keeping the rest constant, we find that any one of the three 
dihedrals in the previous definition is a hard coordinate, since moving one of 
them while keeping the other two constant distorts the internal structure of the 
methyl group. Hence, in these coordinates, the soft rotameric degree of freedom 
X, which we know, for chemical arguments, that must exislQ, is ill-represented. 
In fact, it must be described as a concerted movement of the three dihedrals. 
In references [1,2], this fact is recognized and the concept of "related dihe- 
drals" is introduced, however, no action is taken to change the definition of the 
coordinates. 

In this work, using the ideas of R. Abagyan and coworkers [11], we define a set 
of rules to uniquely and systematically number the groups, the atoms and define 
the internal coordinates of organic molecules and, particularly, of polypeptide^. 

^We will denote by the bond length between atoms i and j\ by {i,j,k), the bond 

angle between the vectors r^j. and rji; and by {i,j,k,l) the dihedral angle between the plane 
defined by the atoms i, j and k and the one defined by j, k and I. 

■^According to our calculations, at the RHF/6-31+G(d) level of the theory, the barrier for 
crossing from one of the three equivalent minima to any of the other two ranges from 3.1 to 
6.8 kcal/mol, depending on the values of the Ramachandran angles (p ^rid tp. Compare with 
the barriers in (/> or which may be as large as 20 kcal / mol depending on the region of the 
Ramachandran map explored. 

•^lUPAC conventions only define a numeration system for the groups, for the branches and 
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Principal 
diiiedral 

(4,3,2,1) 



Phase 
dihedral 

(4,3,2,1) 



Figure 1: Two types of dihedral angles, a) Principal dihedral. Used to describe 
the rotation of whole groups around bonds, b) Phase dihedral. Used to describe the 
internal covalent structure of groups. The positive sense of rotation is indicated. 

The main difference with other Z-matrix-like coordinates normally used in the 
literature [1,2, 15-17] is that, instead of positioning each atom with a bond 
length, a bond angle and a dihedral angle, we use normal dihedral angles (called, 
from now on, "principal dihedrals" ) only to fix the orientation of whole groups 
and a somewhat non-standard type of dihedrals, termed "phase dihedrals" by R. 
Abagyan and coworkers [11] (see fig.[I|), to describe the covalent structure inside 
a groupH. This allows to approximately separate soft and hard movements of the 
molecule using only topological information (i.e., not knowing the exact form 
of the potential) and to easily implement constraints by forcing the coordinates 
that correspond to hard movements to take constant values or ones that depend 
on the soft coordinate^. 

In addition, the coordinates herein defined, are straightforwardly cast into 
Z-matrix form and may be directly implemented in any Quantum Chemistry 
package, such as Gaussian [8] or GAMESS [9]. This is due to the fact that, 

for some selected dihedral angles. They focus on functional considerations and not in compu- 
tational problems. For related documents and references, see |http : //www ■ chem . qmul ■ ac ■ uk/ 1 
iupac/jcbn/. 

^Another option may bo to use, as a third internal coordinate for each atom, another bond 
angle. This is rather awkward, however, since two bond angles and a bond length do not 
specify the position of a point in space. Any values of these three coordinates (except for 
irrelevant degenerate cases) are compatible with two different symmetrical positions and a 
fourth number must be provided to break the ambiguity. 

^In reference [18] they correctly take this approach into account using out-of-plane angles 
instead of phase dihedrals, however, they do not describe any rules for a general definition 
and their numeration of the atoms is non-modular, as it proceeds first through the backbone 
(see sec. |3j. 
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although they involve atoms whose covalent structure is different, the mathe- 
matical construction of the two types of angles in fig. [1] is exactly the same, and 
the phase dihedrals are treated like principal ones without any problem by the 
applications. 

A number of Perl scripts are provided that number the atoms and generate 
the coordinates herein defined for polypeptide chains. The applications read a 
sequence file in which the different ionization states of the titratable side chains, 
the tautomeric forms of Histidine and several terminal groups may be specified. 
Then, an output file is generated with the symbolic definition of the Z-matrix 
of the molecule which may be directly pasted into the input files of Gaussian [8] 
or GAMESS [9] (and, upon slight modifications, of any Quantum Ghemistry 
package that is capable of reading Z-matrix format). These scripts may be 



found at http://neptuno.unizar.es/files/public/gen_sasmic/ 



Now, if we redo the example in tabled] using phase dihedrals, we must write 
the rows of the Z-matrix for the hydrogens in the side chain as shown in table O 

Atom name Bond length Bond angle Dihedral angle 



Hio 


(10,8) 


(10,8,5) 


X ■ 


= (10,8,5,3) 


Hn 


(11,8) 


(11,8,5) 




:=(11,8,5,10) 


Hi2 


(12,8) 


(12,8,5) 


a2 


:=(12,8,5,10) 



Table 2: A part of the internal coordinates, in Z-matrix form, of the protected dipep- 
tide HCO-L-Ala-NH2, as defined by the rules given in sees. [5] and [3] 

Where the angle (10,8,5,3) is now the principal dihedral x describing the 
relative rotation of the methyl group around the bond length (8,5) and the 
other two arc phase dihedrals that describe the internal structure of the group 
and that arc pure hard coordinates (as far as can be told only from topological 
information). However, one must point out that, although all bond lengths, 
bond angles and phase dihedrals may be regarded as hard coordinates, not all 
the principal dihedrals will be soft. Examples of hard principal dihedrals are 
the ones that describe the rotation around a double bond (or a triple one) or 
some of the principal dihedrals in cyclic parts of molecules. 

The physical approach described in this section, which should be taken into 
account when designing internal coordinates, is embodied in a set of rules for any 
organic molecule in sec. [21 with a slightly different prescription for polypeptide 
chains in sec. [3] The systematic numeration introduced facilitates the compu- 
tational treatment of this type of systems and the rules given for polypeptide 
chains ensure modularity [1,19], i.e., allows to add any residue with minimal 
modification of the already existing notation and to easily construct databases 
of structures or of Potential Energy Surfaces (PES). 

The characteristics aforementioned have led us to term the coordinates 
herein defined Systematic, Approximately Separable and Modular Internal Co- 
ordinates (SASMIG). 

In this work, we will only deal with the numeration of one isolated molecule. 
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however, the procedure described may be easily generahzed (and will be in 
future works) to systems of many molecules (an important example being a 
macromolecular solute in a bath of solvent molecules). This could be done using 
ghost atoms in a similar manner to what is done in ref. [12], to position the center 
of mass of the system, and in refs. [11], to actually define the coordinates of a 
system of molecules. 

Finally, in sec. [4l we use the new coordinates and ab initio quantum me- 
chanical calculations in order to evaluate the approximation of the free energy, 
obtained from "integrating out" the rotameric degree of freedom via the 
typical PES in the protected dipeptide HC0-L-Ala-NH2. This will be relevant 
to design effective polypeptide potentials. We also present a small part of the 
Hessian matrix in two different sets of coordinates to illustrate the approximate 
separation of soft and hard movements when the SASMIC defined in this work 
are used. Sec. [5] is devoted to the conclusions. 



Figure 2: Schematic representation of the groups found in proteins. From left to 
right: tetrahedral, triangular and linear. 

First, we realize that any molecule may be formally divided in groups such 
as those in fig. ^ We will call "centers" the shaded atoms in the figure and 
"vertices" the white ones. In general, there may exist groups with more than 
four vertices, however, in proteins, only groups with four or less vertices oc- 
cuj^l. Examples of tetrahedral groups are the one whose center is the Ca in the 
backbone or the Cg in the side chain of alanine, triangular groups occur, for 
example, at the N or the C in the backbone, finally, linear groups may be found 
at the O in the side chain of tyrosine or at the S in methionine (see fig. [TU)) . 

A particular atom may be vertex of different groups but may only be center 
of one group. There exist atoms that are only vertices but there do not exist 
atoms that are only centers, except in the case of molecules with only one group. 
In the trivial case of diatomic molecules (in which the only internal coordinate is 
a bond length), neither of the previous definitions are possible, since we cannot 
identify a group. 

fact that will not be used in the definitions. 



2 General numeration rules 
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Atoms that are covalently bonded to more than one atom wih be called 
"internal atoms" and are indicated as shaded circles in fig. [3l Atoms that are 
covalently attached to only one internal atom will be called "external atoms" 
and are indicated as white-filled circles in fig. [31 In proteins, only H and O may 
be external atoms. 




Figure 3: Schematic representation of the HCO-L-His-NH2 model dipeptide (with the 
side chain in its uncharged 5 tautomeric form). Internal atoms are shown as gray-filled 
circles, external ones as white-filled circles. Internal bonds are indicated with curved 
arrows. Typical biochemical definitions of some principal dihedrals are also shown. 

In most macromolecular models (such as the Born-Oppcnhcimer approxima- 
tion used in sec. [4]), nuclei are considered point-like particles. Hence, rotation 
around bonds joining external and internal atoms (termed "external bonds" or 
"non-dihedral bonds") is neglected, i.e., there are no internal coordinates as- 
sociated to this movement. On the other hand, rotation around bonds joining 
two internal atoms (called "internal bonds" or "dihedral bonds" and indicated 
with curved arrows in fig. [3]) is relevant and there may exist internal coordinates 
describing it. 

In order to conform with the physical approach stated in the introduction, 
only one golden rule must be followed when defining the internal coordinates: 

One principal dihedral, at most, must be defined on each internal 
bond. 

The rest of the rules that will be given are mere tidy conventions and sys- 
tematics. 
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2.1 General rules for numbering the groups 



First of all, we will divide the molecule in groups and number them. To do 
this we proceed by branches, i.e., we choose the next group following a linear 
sequence of covalently attached groups until there is no possible next one, in 
which case, we either have finished the numeration process or we start another 
branch. Every group is numbered one time and it cannot be renumbered as the 
process continues. This numeration is done for completeness and as a support 
for the numeration of atoms and coordinates. 

In fig. [4^, we have implemented these general rules in a protected histidine 
dipeptide. Later, in sec. [31 while stating the rules for polypeptides, we will 
remark the differences (which will lead to fig. [IJd) and show the reasons for the 
special prescriptions using the same example. 



Figure 4: Group identification and numeration in the protected dipeptide HCO-L- 
His-NH2 (with the side chain in its uncharged S tautomeric form), a) Following the 
general rules, b) Following the special rules for polypeptides. The different types of 
groups are shown as gray-filled polyhedra. 

The rules arc as follows: 

i) The first group (j = 1), is chosen, among those that are linked to the 
molecule via only one internal bond (termed "terminal groups"), as the 
one that has the greater mas^. If two or more terminal groups have the 
same mass, we add the mass of their first neighbours to break the tie. If 
this does not lead to a decision, we proceed to the second neighbours and 
so on. If we run out of neighbours and there is still a tie, we choose a group 
arbitrarily among the ones that have been selected via this process and 
we indicate the convention. If there are no terminal groups, we perform 

^The mass of a group is defined as the sum of the atomic masses of its constituents. 



a) 
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this selection process among those groups that have at least one external 
aton^. 

ii) If there is only one unnumbered group linked to group j, we number it as 
j + 1, set j = j + 1 and go to (ii). 

iii) If there are two or more unnumbered groups linked to group j, we choose 
the one with the greater mass as in point (i), we number it as j + I, set 
j = j + 1 and go to (ii). 

iv) If there are no unnumbered groups linked to group j but there are still 
unnumbered groups in the molecule, we set j to the number of the low- 
est numbered group that has unnumbered neighbours (we prepare to start 
another branch) and we go to (ii). 

This process terminates when all the groups are numbered. 

2.2 General rules for numbering the atoms 

The atoms will be numbered in the order that they will be positioned via internal 
coordinates in the Z-matrix. As in the previous subsection, in fig. O these 
general rules, as well as the special rules for polypeptides, are exemplified in a 
protected histidine dipeptide. In sec. [31 we will remark the differences and show 
the advantages of slightly modifying the prescription. 




Figure 5: Atom numeration of the protected dipeptide HCO-L-His-NH2 (with the 
side chain in its uncharged 5 tautomeric form), a) Following the general rules, b) 
Following the special rules for polypeptides. 

The rules are as follows: 

**The rare case in which there are neither terminal groups nor external atoms (such as 
Ceo fuUcrene) will not be treated here, although it would require only a small number of 
adjustments to the rules. 
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i) The first atom (fc = 1), is chosen as the heaviest of the external atoms in 
the first group. If there are two or more candidates with the same mass, 
we choose arbitrarily and indicate the convention. 

ii) The second atom {k = 2) is the center of the first group and we set j = 1 
(the index of the group). 

iii) If group j + 1 exists and is covalcntly attached to group j, wc number 
the unnumbered vertices of group j starting by the center of group J + 1 
and, then, in order of decreasing mass. If, otherwise, group j + 1 does 
not exist or it is not covalcntly attached to group j, we simply number 
the unnumbered vertices of group j in order of decreasing mass. If, at any 
point, there are two or more candidates with the same mass, we choose 
arbitrarily and indicate the convention. Exception: If groups j and j + 1 
belong to the same cyclic part of the molecule, the vertices of j that are 
centers of groups (other than j + 1) belonging to the same cycle must not 
be numbered at this step (for an example of this rule, see the numeration 
of Ci7 and N22 in fig. [S^, or C13 and Nig in fig.[S)D). 

iv) If group J + 1 does not exist, we have finished. Otherwise, we set j = j + 1 
and go back to (iii). 

2.3 General rules for defining the internal coordinates 

Using the numeration for the atoms given in the previous section, we give now a 
set of rules for defining the internal coordinates that conform with the physical 
approach discussed in the introduction of this work. The coordinates are written 
in Z-matrix form (see table [3]) for convenience and the rules are applied to the 
protected dipeptide HC0-L-His-NH2 (with the side chain in its uncharged S 
tautomeric form) using the general numeration given in fig. [5^. 
The rules are as follows: 

i) The positioning of the first three atoms is special. The corresponding rows 
of the Z-matrix are always as the ones in table [3] (except, of course, for 
the chemical symbol in the first column, which may change). 

ii) The positioning of the remaining vertices of group number 1 (if there is 
any) is also special, their rows in the Z-matrix are: 

T, (i,2) (*,2,3) («,2,3,1) 

Where T is the chemical symbol of the i-th atom, and (z, 2, 3, 1) is a phase 
dihedral. 

iii) We set i to the number that follows that of the last vertex of the first group. 

iv) We choose j as the lowest numbered atom that is covalcntly linked to i. 

v) We choose k as the lowest numbered atom that is covalcntly linked to j. 
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Atom name Bond length Bond angle Dihedral angle 



Oi 








C2 


(2,1) 






N3 


(3,2) 


(3,2,1) 




H4 


(4,2) 


(4,2,3) 


(4,2,3,1) 


C5 


(5,3) 


(5,3,2) 


(5,3,2,1) 


He 


(6,3) 


(6,3,2) 


(6,3,2,5) 


C7 


(7,5) 


(7,5,3) 


(7,5,3,2) 


Cg 


(8,5) 


(8,5,3) 


(8,5,3,7) 


H9 


(9,5) 


(9,5,3) 


(9,5,3,7) 


Nio 


(10,7) 


(10,7,5) 


(10,7,5,3) 


On 


(11,7) 


(11,7,5) 


(11,7,5,10) 


H12 


(12,10) 


(12,10,7) 


(12,10,7,5) 


Hx3 


(13,10) 


(13,10,7) 


(13,10,7,12) 


Cx4 


(14,8) 


(14,8,5) 


(14,8,5,3) 


Hi5 


(15,8) 


(15,8,5) 


(15,8,5,14) 


H16 


(16,8) 


(16,8,5) 


(16,8,5,14) 


Cx7 


(17,14) 


(17,14,8) 


(17,14,8,5) 


N18 


(18,17) 


(18,17,14) 


(18,17,14,8) 


Hl9 


(19,17) 


(19,17,14) 


(19,17,14,18) 


C20 


(20,18) 


(20,18,17) 


(20,18,17,14) 


H21 


(21,18) 


(21,18,17) 


(21,18,17,20) 


N22 


(22,20) 


(22,20,18) 


(22,20,18,17) 


H23 


(23,20) 


(23,20,18) 


(23,20,18,22) 



Table 3: Internal coordinates in Z-matrix form of the protected dipeptide HCO-L-His- 
NH2 (with the side chain in its uncharged S tautomeric form), following the general 
rules. Principal dihedrals are indicated in bold face. 

vi) If no principal dihedral has been defined on the bond (j. A: we choose / 
as the lowest numbered atom that is covalently linked to k. Otherwise, we 
choose / as the second lowest numbered atom that is covalently linked to 
j (i.e., the lowest numbered atom that is covalently linked to j and that 
is different from fc, or, equivalcntly, the atom that was used to define the 
only principal dihedral on the bond (j, fc)). 

vii) The row of the Z-matrix that corresponds to atom i is: 

Ti (?,j) {i,j,k) {i,j,k,l) 

Where T is the chemical symbol of atom i, {i,j) is a bond length, {i,j, k) 
is a bond angle and (i,j,k,l) is a principal dihedral if the first case in 
point (vi) has occurred or a phase dihedral otherwise. 

^We say that a principal dihedral {i,j,k,l) is "on the bond {j,ky\ 
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viii) If i + 1 docs not exist, we have finished. Otherwise, we set i = i + 1 and 
go to (iv). 

3 Special rules for polypeptides 

The numeration of the groups in a polypeptide chain is the same as in sec. 12.11 
except for some details: 

• Instead of using rule (i) for choosing the first group, we seleclF°l: 

— The amino group at the N-terminus (either charged or not) if the 
polypeptide is not N-protected. 

— The formyl group at the N-tcrminus if the polypeptide is formyl-N- 
protected. 

— The methyl group at the N-tcrminus if the polypeptide is acetyl-N- 
protccted. 

This is done because the primary structure of a polypeptide is normally 
presented from the N- to the C-terminus. In the case of IIC0-L-His-NIl2 
in fig. m the agreement between rule (i) and this one is accidental. 

• When we must choose the next group to the one whose center is a Ca in the 
backbone, instead of applying rule (iii), which would yield the group at the 
C as the next one (compare fig.H^ and[3]3), wc choose the first group in the 
side chain (for residues that arc different from glycine). Then, wc resume 
the numeration as usual. This is done in order to ensure modularity, 
since, if we numbered following rule (iii), the backbone would be always 
numbered first and the whole numeration would have to be modified if we 
added a new residue to the chain. 

• Also for modularity reasons, we want to completely number the side chain 
before proceeding into the backbone. Hence, if we are numbering side 
chain atoms and the requirements to apply rule (iv) are fulfilled, instead 
of applying this rule, we set j to the number of the lowest numbered group 
that has unnumbered neighbours and that belongs to the side chain of the 
residue whose groups we are numbering. Then, we go to (ii) as usual. 

The numeration of the atoms in a polypeptide chain is the same as in sec. 12.21 
except for some details: 

• We seek that the principal dihedrals that are to be defined after numbering 
the atoms conform to the biochemical lUPAC conventions for the dihedrals 
(j), ip and uj in the backbone. At the termini, we want that the atom where 
the Cq of the hypothetical residue or iV + 1 would occur is used to define 
the principal dihedrals. The general rules ensure this, except in two cases 
where a special convention must be given: 

^"These three cases are the most frequent. If a different species is used to N-protect the 
polypeptide chain, a convention must be sought that also starts at the N-terminus. 
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— If the polypeptide is formyl-N-protected, instead of applying rule (i) 
to choose the first atom, which would give the oxygen at the formyl 
group, we choose the hydrogen at the formyl group (compare fig. [S^ 
and[S)D). Then, we resume the numeration at rule (ii). 

— If the polypeptide is amide-C-protected, instead of applying rule (iii) 
and arbitrarily choosing one of the hydrogens in the terminal amide 
group before the other, we number the trans hydrogen before the 
other (compare fig. andO))- 

• Due to the special rules for the numeration of groups given above, the next 
group to the one at the Cq, is the first one in the side chain. If we applied 
rule (iii) for numbering the vertices of the Ca-group, we would number 
first the center of the first group at the side chain and, then, the C in the 
backbone. This would make the only principal dihedral defined on bond 
(Cq.,N) different from the conventional Ramachandran angle cj). In order 
to avoid this, instead of applying rule (iii) at this point, we number the C 
first among the unnumbered vertices of the Ca-group and, then, resume 
the usual numeration process. 

See fig. [To] for the numeration of the twenty naturally ocurring amino acids 
with formyl-N- and amide-C-protection. 

Once the groups and the atoms have been numbered following these special 
prescriptions, the definition of the internal coordinates for polypeptide chains is 
identical to the one described for the general case in sec. 12.31 

4 Application 

4.1 Theory 

When a number of degrees of freedom are removed from the description of the 
conformations of a physical system via their integrating out in the partition 
function, the energy function that remains, which describes the behaviour of 
the system only in terms of the rest of the degrees of freedom, is a free energy. 
It depends on the temperature and contains the entropy of the information that 
has been averaged out as well as the enthalpy. However, it is frequent, when 
studying the conformational preferences of model dipeptides in order to use 
the information for designing effective potentials of polypeptides [20-24], that 
the energy of these molecules be approximated by the Potential Energy Surface 
(PES) in the bidimensional space spanned by the Ramachandran angles cf) and 
•0 [18,24-26]. If we recognize that the potential energy of the system in the 
Born-Oppenheimer approximation (denoted by V^n-q) depends on the 3iV — 6 
internal coordinates, this surface (denoted by V2) may be defined as: 

V2{^,lb) -.^ min V3N-6i<f>.i^,Q") ■ (1) 
w 

Where denotes the rest of the internal coordinates. 
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The use of this surface, instead of a free energy function with the degrees 
of freedom integrated out, is justified in the approximation that these internal 
coordinates are hard and that they are comparably much more difficult to excite 
at room temperature than (j) and ip. If we assume that this is correct, these hard 
degrees of freedom may be easily eliminated [12] and the partition function of 
the system may be written as follows: 

Z = C j d(/idV'dQ"e-^^^"-''('^''^''5°) ~^'/ #d?/'e-'^^^('^''^) . (2) 
Where (3 := l/RT. 

Note however that, in the "flexible" picture for the constraints, this expres- 
sion is correct only if we assume that the Jacobian determinant of the change of 
coordinates from Cartesians to {</>, ■0, Q"} and the determinant of the potential 
second derivatives matrix with respect to the hard coordinates, both evaluated 
at the equilibrium values, do not depend on and ^ (see ref. [12]). If, alterna- 
tively, we accept the "rigid" picture for the constraints, we must ask that the 
determinant of the induced metric tensor in the constrained sub-manifold do not 
depend on and '0 [27] . If these approximations (which will be reexamined in 
future works) do not hold but the hardness of the Q" degrees of freedom is still 
assumed, the expressions in eq. [2] must be modified by adding some correction 
terms to V2[(f), 

In eq. [2] for the partition function, one also may see that, apart from the 
different multiplicative constants C and C", which do not affect the expected 
values of observables, the use of the PES V2{(j),tp) as the fundamental energy 
function of the system is justified because it plays the same role as the whole 
potential energy of the system in the first integral. 

However, although the hardness of the bond lengths, the bond angles and 
even the dihedral w in the peptide bond may be assumed, this is not a good 
approximation for the rotamcric degrees of freedom in the side chains of residues. 
In the frequently studied [18,25] example of HC0-L-Ala-NH2 (see fig. H]), as it 
has already been said in footnote [2l the side chain degree of freedom x must 
be regarded as soft. Still, although it is more complex, a soft degree of freedom 
may also be averaged out if it is considered convenient. 

In this section, we will assume that the energy of the formyl-alanine-amide 
dipeptide may be correctly approximated by a Potential Energy Hypersurface 
(PEH) (denoted by V3) that depends on the Ramachandran angles cj) and ■0 but 
also on the principal dihedral x that describes the rotation of the methyl group 
in the side chain. Analogously to eq. [H its definition in terms of the whole 
energy of the system is: 

^3(0,0,%) :=miny3Ar-6(0,^,X,Q'") • (3) 

Where Q'" represents the internal coordinates that are not (f>, tp or x- 
Note, in addition, that the two definitions are related by the following ex- 
pression: 
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Figure 6: Atom numeration of tiie protected dipeptide HC0-L-Ala-NH2 following the 
special rules for polypeptides. 



T/2(0,V')=minV^3(0,V',x) • (4) 

X 

We will also assume for V3{(j),ip,x) the aforementioned approximations that 
lead to eq.[2l in such a way that we can write (deUberately omitting the irrelevant 
multipUcative constants): 



d(/>di/'dxe-''^='('^''^''^) = J d(AdV'^(0,V) := 
d^dV-e-''^^^'^) . (5) 
Where we have defined: 

Z(</), V) := e-^^^^'^^ ■= I dxe-''^^('^''^'X) . (6) 



This is what must be done in general when a soft degree of freedom is 
needed to be integrated out in Statistical Mechanics [28] and the approximations 
in ref. [12] cannot be made. The function F{(f>,ip) is a free energy because, in 
general, it depends on the temperature and it contains the entropy of the degree 
of freedom x whose influence has been averaged. 

Wc must remark at this point that, to integrate out the side chain angle 
X could be reasonable if one's aim is to use the ab initio obtained information 
from a single dipeptide to include it in an effective potential for simulating 
polypeptides. There is no point in integrating out the Ramachandran angles </> 
and ip, since the conformation of the larger system will depend crucially on their 
particular values, because they lie in the backbone of the molecule and there 
are as many pairs {(f>, ip) residues in the chain. The side chain angle on 
the contrary, will only influence its immediate surroundings and its importance 
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could be of different magnitude depending on the treatment that the side chains 
are given in the model for the polypeptide. 

In this context, if we wanted to use an energy function that docs not depend 
on X (in some circumstances, a computational must), we would have to perform 
the integral in the last term of eq.lHand use -F'((/>, tp) instead of V2(0, ip), since, as 
it has already been remarked, x is not a hard coordinate and the approximations 
needed to write eq. [2] do not hold. Therefore, if we compare the last term in 
eq. [5] with the last term in eq. [U we see that, apart from additive constants 
that do not depend on cf> and ■0 and that come from the multiplicative constants 
omitted, the PES V2{4>, ip) must be understood as a candidate for approximating 
the more realistic F{(f>,ip) and saving much computational effort. 

To assess the goodness of this approximation in the particular case of formyl- 
alanine-amide is what will be done in the following subsections. 

4.2 Methods 

The ab initio quantum mechanical calculations have been done with the pack- 
age GAMESS [9] under Linux. The coordinates used for the HCO-L-Ala-NHa 
dipeptide in the GAMESS input files and the ones used to "move" the molecule 
in the the automatic Perl scripts that generated the input files are the SASMIC 
defined in sees. [2] and [3l They are presented in table |4] indicating the name 
of the conventional dihedral angles (see also fig. [5] for reference) . In the en- 
ergy optimizations, on the contrary, they have been converted to Delocalized 
Coordinates [4] to accelerate convergence. 

First, we have calculated the typical PES V2 {(f>, "0) defined in eq.[T]in a regular 
12x12 grid, with both 4> and tp ranging from —165° to 165° in steps of 30°. This 
has been done by running energy optimizations at the RHF/6-31+G(d) level of 
the theorjf^, freezing the two Ramachandran angles at each value on the grid, 
starting from geometries previously optimized at a lower level of the theory 
and setting the gradient convergence criterium to 0PTT0L=0.0001 and the self- 
consistent Hartree-Fock convergence criterium to CONV=0. 00001. 

Then, at each grid point, we have defined another one-dimensional grid in 
the coordinate x that ranges from Xo('/'jV') ~ 50° to Xo(0, V') + 60° in steps 
of 10°, where xo{4't''P) is one of the three equivalent equilibrium values (se- 
lected arbitrarily) of this degree of freedom at each point of the original PES. 
This partition in 12 points spans one third of the x-space, but it is enough 
for computing the integrals because the surface V3(0, ip, x) has exact three- fold 
symmetry in x (note, for example, that the value of V3 at Xo('/'j "0) ~ 60° would 
be equal to the one at Xo(0j "0) + 60°). Next, we have run energy optimizations, 
with the same parameters described above and at the same level of theory, at 
each point of the x-grid for every grid- value of the PES (i.e., freezing the three 
angles). The starting geometries have been automatically generated via Perl 

^^Since this is only an exploratory study, the RHF/6-31-^G(d) level of the theory is con- 
sidered enough to show the general trend and to illustrate the procedure to be followed in a 
more exhaustive assessment with more reliable ab initio methods. 
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Atom name Bond length Bond angle 



Dihedral angle 



Hi 

C2 (2,1) 



N3 (3,2) (3,2,1) 

O4 (4,2) (4,2,3) 

C5 (5,3) (5,3,2) 

He (6,3) (6,3,2) 

C7 (7,5) (7,5,3) 

Cg (8,5) (8,5,3) 

Hg (9,5) (9,5,3) 

Hio (10,8) (10,8,5) 

Hii (11,8) (11,8,5) 

H12 (12,8) (12,8,5) 

Ni3 (13,7) (13,7,5) 

Oi4 (14,7) (14,7,5) 

Hi5 (15,13) (15,13,7) 

H16 (16,13) (16,13,7) 



uji :=(15, 13,7,5) 

(16,13,7,15) 



X :=(10,8,5,3) 

(11,8,5,10) 
(12,8,5,10) 
^ :=(13,7,5,3) 

(14,7,5,13) 



(4,2,3,1) 
LOO :=(5,3,2,1) 

(6,3,2,5) 
0:=(7,5,3,2) 

(8,5,3,7) 
(9,5,3,7) 



Table 4: Internal coordinates in Z-matrix form of the protected dipeptide HCO-L- 
Ala-NH2, following the special rules for polypeptides. Principal dihedrals are indicated 
in bold face and their typical biochemical name is given. 

scripts taking the final geometries in the (</), ■(/')-grid and systematically chang- 
ing X- Note that this amounts to only changing the principal dihedral (10,8,5,3) 
in the Z-matrix in table [4l with poorly designed coordinates that did not sep- 
arate the hard modes from the soft ones, this process would have been more 
difficult and rather unnatural. 

After all the optimizations (~ 54 days of CPU time in 3.20 GHz PIV 
machines), we have 12x12x12=1728 points with grid coordinates (f^i, "^ij Xfe) 
i, J, A: = 1 ... 12 of the function V3((/), ip, x) ^'^d we may approximate the integral 
defining F{(f>,ip) in cq. [6]by a finite sum: 



Where additive constants arising from the three-fold symmetry in the coor- 
dinate X have been discarded. 

The harmless quantity {V3){(j),ip), defined as: 






k 
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has been introduced in order for the vahies of the exponential function to be 
in the precision range of the computer. 

Analogously, the average energy may be computed via: 



k 



k 

(9) 



E 



g-/9[V3Wfc>,,Xfc)-(V3>(0.,V'.)] 



k 

And, finally, we extract the entropy from: 

F{c^,,i^,) = Ui(t>^,^j) ~ V,) . (10) 

Additionally, apart from the calculations needed to integrate out x-i we have 
also performed an unconstrained geometry optimization in the basin of attrac- 
tion of the local minima of the PES normally known as 7l or C7eq depending 
on the author [26]. This calculation was done at the MP2/6-31-I— |-G(d,p) level 
of the theory and with the same values of the variables OPTTOL and CONV than 
the ones used in the PES case. The starting geometry was the final structure 
corresponding to the point (—75°, 75°) of the PES calculations at the lower level 
of the theory described in the preceding paragraphs. 

In the local minimum found, we have computed the Hessian matrix (also at 
MP2/6-3H— |-G(d,p)) in two different sets of coordinates: the properly defined 
SASMIC shown in table [4] and an ill-defined set in which the lines corresponding 
to the hydrogens Hio, Hn and H12 in the side chain have been substituted by 
those in table [TJ This is done to numerically illustrate the better separation 
of the hard and soft modes achieved by the internal coordinates defined in this 
work. 



4.3 Results 

In order to assess if V2(0,'0) could be considered a good approximation of 
F{(j),'ijj), we have used a statistical quantity, defined in [29], which measures 
the typical error that one makes in the energy differences between arbitrary 
pairs of conformations of the system if one effective potential is used instead of 
the other. If we measure this distance between F{(j)^il>) and V2{4>,^), using the 
144 points in the ((/>, ■0)-grid, we obtain: 

d{F,V2) = Qmd, RT . (11) 

We present the result in units of RT (at 300° K, where RT ~ 0.6 kcal/mol) 
because it has been argued in [29] that, if the distance between two different 
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F (kcal/mol) 



-TS (kcal/mol) 



20.0 



10.0 



15.0 



0.0 



3.0 




(a) 



(b) 



Figure 7: Ramachandran plots of (a) the free energy F{(j), ip) and (b) —TS{(f), tp) in 
the model dipeptide HC0-L-Ala-NH2. 

approximations of the energy of the same system is less than RT, one may safely 
substitute one by the other without altering the relevant physical properties. In 
this case, this criterium is widely satisfied. Moreover, if one assumes that the 
effective energy studied will be used to construct a polypeptide potential and 
that the latter will be designed as simply the sum of mono-residue ones (making 
each term suitably depend on different pairs of Ramachandran angles), then, the 
number N^es of residues up to which one may go keeping the distance between 
the two approximations of the the TV-residue potential below RT is (see ref. [29] ) : 



The goodness of the approximation in this case is much due to the simplicity 
and small size of the side chain of the alanine residue and also to the fact that the 
dipeptide is isolated. For bulkier residues included in polypeptides, we expect 
the difference between F{4>,'ip) and V2(0,V') to be more important. 

Although the essential result is the one stated in the previous paragraphs, we 
wanted to look in more detail at the origin of the differences between F(^, ■;/)) and 
V2{(j>,'4')- For this, we have first subtracted from F{(t),i(j), U{(j),4') and V2{(f>,il') 
the same constant reference (min ^"(0, '0)13 order to render the numerical 
values more manageable and to minimize the statistical error of the y-intercept 
in the linear fits [30, 31] that will be made in the following. 

Then, fitting U{(f>,ip) against V2((/>, '!/')j have found that they are more 
correlated than F{(f),ip) and V2 (</),')/') (compare the Pearson's correlation coef- 
ficient, r{U,V2) = 0.999999 vs. r{F,V2) = 0.999954, and the aforementioned 
distance, d{U,V2) = 0.015 RT vs. d{F,V2) = 0.098 RT), and that they are 
separated by an almost constant offset: V2 (</),■!/') is ~ 0.3 kcal/mol lower that 
U{(l),ip) (on the other hand, V2{4>,ip) is ^ 0.6 kcal/mol higher than F{(l),tp)). 

^■^At the level of the theory used in the calculations, the minimum of F(if>, ip) in the grid is 
-414.7985507934 hartree. 




(12) 
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Hence, the three Ramachandran surfaces F{(/), ip), U{(j), tp) and V2(^, tp) are very 
similar, except for an offset. In fig. [7^, F{(p,'ip) is depicted graphically and, in 
fig [51 the relative offsets among the three energies are schematically shown. 



0.3 kcal/mol 



0.6 kcal/mol 



0.9 kcal/mol ~ {TS{(f),ij)] 



Figure 8: Relative offsets among the thermodynamical surfaces involved in the study. 

Contrarily, the entropy (we use TS^cj), ip) in order to deal with quantities 
that have units of energy), which may be found in fig. [7t), and whose average 
magnitude is ^ 0.9 kcal/mol, is almost uncorrelated with F{(t),ip), U{(l),tp) and 
^2(0,-0), being the correlation coefficients r{TS,F) = 0.382, r{TS,U) = 0.379 
and r{TS,V2) = 0.381, respectively. Hence, given that (i(C/, V2) is almost an 
order of magnitude lower than d{F,V2); it is reasonable to conclude that the 
greatest part of the (little) noise between F{(j>,^) and V2 (</>,■(/') comes from the 
entropic term —TS{(j),ip). This is supported by the fact that the difference 
F{<f),tl') — V2{(l),ip) is highly correlated with TS{(l),tp), being the correlation 
coefficient r{F - V2,TS) = 0.998. 

Finally, and in order to illustrate the better separation of the hard and 
soft modes achieved by the internal coordinates defined in this work, we have 
calculated the Hessian matrix in the minimum 7l (also C7oq) in two different 
sets of coordinates. They are described at the end of scc. l4.2l and they correspond 
to the SASMIC set, defined according to the rules given in sec.[3l and a set in 
which the coordinates that position the hydrogens in the side chain have been 
ill-defined. 

In fig. [HI we present the sub-boxes of the two Hessian matrices corresponding 
to the coordinates defined in tables [2] and [TJ 



Properly defined coordinates 

X Ql 02 



X 
ttl 
a2 



15.74 1.40 
1.40 110.98 
8.71 -54.23 



8.71 
-54.23 
115.37 



71 

72 
73 



Ill-defined coordinates 

7i 72 73 

113.49 -55.55 -52.60 
-55.55 110.98 -54.23 
-52.60 -54.23 115.37 



Figure 9: Sub-boxes of the Hessian matrix in the minimum 7l (also C7cq) corre- 
sponding to the coordinates defined in tables [2] and [1] The quantities are expressed in 
kcal/mol ■ rad~^. See the text for more details. 
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From the values shown, one can conclude that, in the "properly defined coor- 
dinates", some convenient characteristics are present: on one side, the relatively 
low values of the elements ff^cj and H^a^ (and their symmetric ones) indicate 
that the soft degree of freedom x and the hard ones ai and q;2, which describe 
the internal structure of the methyl group, are uncoupled to a reasonable extent; 
on the other side, the relatively low value of i?^^ compared to Ha^ai and Ha^a2 
(a difference of almost an order of magnitude) proves that x may be regarded 
as soft when compared to the hard degrees of freedom ai and a2- 

On the contrary, in the "ill-defined coordinates", the three dihedrals are 
hard, considerably coupled and equivalent. 

5 Conclusions 

Extending the approach of refs. [11] and the ideas stated in [1,2, 19], we have 
defined a systematic numeration of the groups, the atoms and the internal coor- 
dinates (termed SASMIC) of organic molecules and, particularly, of polypeptide 
chains. The advantages of the rules herein presented are many-fold: 

• The internal coordinates may be easily cast into conventional Z-matrix 
form and they can be directly implemented into quantum chemical pack- 
ages. 

• The algorithm for numbering allows for automatizing and facilitates the 
coding of computer applications. 

• The modularity of the numeration system in the case of polypeptides per- 
mits the addition of new residues without essentially changing the already 
numbered items. This is convenient if databases of peptide structures need 
to be designed. 

• The set of internal coordinates defined reasonably separate the hard and 
soft movements of organic molecules for arbitrary conformations using 
only topological information. 

A number of Perl scripts that automatically generate these coordinates for 
polypeptide chains arc provided as supplementary material. They may be found 
at http : //neptuno .unizar . es/f iles/public/gen_sa smic/. 

In addition, we have used the coordinates herein defined and ab initio Quan- 
tum Mechanics to assess the approximation of the free energy obtained from 
averaging out the rotameric degree of freedom x via the conventional PES in 
the protected dipeptide HC0-L-Ala-NH2. Applying the criterium in rcf. [29], we 
have found that approximating F{(f>,ip) by V2((/>, "0) is justified up to polypep- 
tides of medium length (~ 100 residues) and much computational effort may 
be saved using the PES instead of the more realistic free energy. However, the 
small size of the side chain of the alanine residue and the fact that the dipeptide 
is isolated do not allow to extrapolate this result. For bulkier residues included 
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in polypeptides, we expect the difference between F{(j),'i{}) and V2{(j),ip) to be 
more important. 
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Aliphatic 
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He O1B 
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H / 
H-Xf H„ 
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\ -i: 

H.,-0Q)C.-H. 
H2 \ ,^^Hi2 



PhenjtUaiiijie - F - Phe 



,H„ 

Ha^ \ .^Hi2 

>. H, Y 



T>rDsme - Y-Tjr 



Ha / H 
He 

Toptophan - W - Trp 



Histidine-H-His 



^Ci6 ^Hi7 
/ 

Hi5',„ r 
a Hi ^'^ 
I I: 

He 

Ljsine - K - Lys 



f Hs >^ 



\..^H,. 

j?4 Hs His 

I li 

He Oi7 
Aspartic Add - D - Sap 



Ol7 / 

His'.,,,../ 
H ^1° 

\ -J*""" 

a H, Hi= 

I II 

hi Qm 
Glulamic.4dd - E - Glu 



Figure 10: Numeration of the left-handed dipeptides HCO-L-X-NH2, where X runs 
on the twenty naturally ocurring amino acids (except for Glycine, which is the achiral 
species HCO-Gly-NH2). Uncharged side chains are displayed and Histidine is shown 
in its 5 tautomeric form. The rules used for numbering are the special version for 
polypeptides. 
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